20 research outputs found

    RITA: a Study on Scaling Up Generative Protein Sequence Models

    Full text link
    In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Full text link
    Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it

    LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor

    Full text link
    We introduce LightOn's Optical Processing Unit (OPU), the first photonic AI accelerator chip available on the market for at-scale Non von Neumann computations, reaching 1500 TeraOPS. It relies on a combination of free-space optics with off-the-shelf components, together with a software API allowing a seamless integration within Python-based processing pipelines. We discuss a variety of use cases and hybrid network architectures, with the OPU used in combination of CPU/GPU, and draw a pathway towards "optical advantage".Comment: Proceedings IEEE Hot Chips 33, 202

    What Language Model to Train if You Have One Million GPU Hours?

    Full text link
    The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .Comment: Findings of EMNLP 202

    Memory consolidation in the cerebellar cortex

    Get PDF
    Several forms of learning, including classical conditioning of the eyeblink, depend upon the cerebellum. In examining mechanisms of eyeblink conditioning in rabbits, reversible inactivations of the control circuitry have begun to dissociate aspects of cerebellar cortical and nuclear function in memory consolidation. It was previously shown that post-training cerebellar cortical, but not nuclear, inactivations with the GABA(A) agonist muscimol prevented consolidation but these findings left open the question as to how final memory storage was partitioned across cortical and nuclear levels. Memory consolidation might be essentially cortical and directly disturbed by actions of the muscimol, or it might be nuclear, and sensitive to the raised excitability of the nuclear neurons following the loss of cortical inhibition. To resolve this question, we simultaneously inactivated cerebellar cortical lobule HVI and the anterior interpositus nucleus of rabbits during the post-training period, so protecting the nuclei from disinhibitory effects of cortical inactivation. Consolidation was impaired by these simultaneous inactivations. Because direct application of muscimol to the nuclei alone has no impact upon consolidation, we can conclude that post-training, consolidation processes and memory storage for eyeblink conditioning have critical cerebellar cortical components. The findings are consistent with a recent model that suggests the distribution of learning-related plasticity across cortical and nuclear levels is task-dependent. There can be transfer to nuclear or brainstem levels for control of high-frequency responses but learning with lower frequency response components, such as in eyeblink conditioning, remains mainly dependent upon cortical memory storage

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Full text link
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License

    Linear Optical Random Projections Without Holography

    No full text
    International audienceWe introduce what we believe to be a novel method to perform linear optical random projections without the need for holography. Our method consists of a computationally trivial combination of multiple intensity measurements to mitigate the information loss usually associated with the absolute-square non-linearity imposed by optical intensity measurements. Both experimental and numerical findings demonstrate that the resulting matrix consists of real-valued, independent, and identically distributed (i.i.d.) Gaussian random entries. Our optical setup is simple and robust, as it does not require interference between two beams. We demonstrate the practical applicability of our method by performing dimensionality reduction on high-dimensional data, a common task inrandomized numerical linear algebra with relevant applications in machine learning

    Changes in complex spike activity during classical conditioning

    Get PDF
    The cerebellar cortex is necessary for adaptively timed conditioned responses (CRs) in eyeblink conditioning. During conditioning, Purkinje cells acquire pause responses or "Purkinje cell CRs" to the conditioned stimuli (CS), resulting in disinhibition of the cerebellar nuclei (CN), allowing them to activate motor nuclei that control eyeblinks. This disinhibition also causes inhibition of the inferior olive (IO), via the nucleo-olivary pathway (N-O). Activation of the IO, which relays the unconditional stimulus (US) to the cortex, elicits characteristic complex spikes in Purkinje cells. Although Purkinje cell activity, as well as stimulation of the CN, is known to influence IO activity, much remains to be learned about the way that learned changes in simple spike firing affects the IO. In the present study, we analyzed changes in simple and complex spike firing, in extracellular Purkinje cell records, from the C3 zone, in decerebrate ferrets undergoing training in a conditioning paradigm. In agreement with the N-O feedback hypothesis, acquisition resulted in a gradual decrease in complex spike activity during the conditioned stimulus, with a delay that is consistent with the long N-O latency. Also supporting the feedback hypothesis, training with a short interstimulus interval (ISI), which does not lead to acquisition of a Purkinje cell CR, did not cause a suppression of complex spike activity. In contrast, observations that extinction did not lead to a recovery in complex spike activity and the irregular patterns of simple and complex spike activity after the conditioned stimulus are less conclusive

    Bidirectional plasticity of purkinje cells matches temporal features of learning

    No full text
    Many forms of learning require temporally ordered stimuli. In Pavlovian eyeblink conditioning, a conditioned stimulus (CS) must precede the unconditioned stimulus (US) by at least about 100 ms for learning to occur. Conditioned responses are learned and generated by the cerebellum. Recordings from the cerebellar cortex during conditioning have revealed CS-triggered pauses in the firing of Purkinje cells that likely drive the conditioned blinks. The predominant view of the learning mechanism in conditioning is that long-term depression (LTD) at parallel fiber (PF)-Purkinje cell synapses underlies the Purkinje cell pauses. This raises a serious conceptual challenge because LTD is most effectively induced at short CS-US intervals, which do not support acquisition of eyeblinks. To resolve this discrepancy, we recorded Purkinje cells during conditioning with short or long CS-US intervals. Decerebrated ferrets trained with CS-US intervals ≄150 ms reliably developed Purkinje cell pauses, but training with an interval of 50 ms unexpectedly induced increases in CS-evoked spiking. This bidirectional modulation of Purkinje cell activity offers a basis for the requirement of a minimum CS-US interval for conditioning, but we argue that it cannot be fully explained by LTD, even when previous in vitro studies of stimulus-timing-dependent LTD are taken into account
    corecore